[MoE Refactor][14/N] Clean Up FI Quant Config Smuggling by robertgshaw2-redhat · Pull Request #31593 · vllm-project/vllm

robertgshaw2-redhat · 2026-01-01T00:43:36Z

Purpose

moe refactor --- flashinfer smuggles scales for certain kernels via nvpf4 global scales --- clean up
potentially fixes issues with flashinfer per tensor for non-modelopt --- it avoids crashing mixtral but we still have 0% accuracy on mixtral. Will address this in another PR.

Test Plan

# autofp8
MODEL_BLOCK := "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8"
# MODEL_TENSOR := "amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV"

# modelopt
MODEL_TENSOR := "nvidia/Llama-4-Scout-17B-16E-Instruct-FP8"

GPUS := "2"
PORT := "8001"

# sm90
launch_cutlass_block:
	VLLM_USE_DEEP_GEMM=0 VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=throughput vllm serve {{MODEL_BLOCK}} -tp {{GPUS}} --port {{PORT}}

# sm90
launch_cutlass_tensor:
	VLLM_USE_DEEP_GEMM=0 VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=throughput vllm serve {{MODEL_TENSOR}} -tp {{GPUS}} --port {{PORT}} --max-model-len 8192

# sm100
launch_trtllm_block:
	VLLM_USE_DEEP_GEMM=0 VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=latency chg run --gpus {{GPUS}} -- vllm serve {{MODEL_BLOCK}} -tp {{GPUS}}

# sm100
launch_trtllm_tensor:
	VLLM_USE_DEEP_GEMM=0 VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=latency  chg run --gpus {{GPUS}} -- vllm serve {{MODEL_TENSOR}} -tp {{GPUS}} --max-model-len 8192

eval_block:
	lm_eval \
		--model local-completions \
		--tasks gsm8k \
		--model_args "model={{MODEL_BLOCK}},base_url=http://localhost:{{PORT}}/v1/completions,num_concurrent=1000,tokenized_requests=False"

eval_tensor:
	lm_eval \
		--model local-completions \
		--tasks gsm8k \
		--model_args "model={{MODEL_TENSOR}},base_url=http://localhost:{{PORT}}/v1/completions,num_concurrent=1000,tokenized_requests=False"

Test Result

Llama4 Scout

cutlass tensor (h100 / b200)

- h100
local-completions (model=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8,base_url=http://localhost:8002/v1/completions,num_concurrent=1000,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9227|±  |0.0074|
|     |       |strict-match    |     5|exact_match|↑  |0.9075|±  |0.0080|

trtllm tensor (b200)

local-completions (model=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8,base_url=http://localhost:8000/v1/completions,num_concurrent=1000,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9242|±  |0.0073|
|     |       |strict-match    |     5|exact_match|↑  |0.9075|±  |0.0080|

Qwen3-30B

cutlass block (h100)

local-completions (model=Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8,base_url=http://localhost:8001/v1/completions,num_concurrent=1000,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8749|±  |0.0091|
|     |       |strict-match    |     5|exact_match|↑  |0.8855|±  |0.0088|

trtllm block (b200)

local-completions (model=Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8,base_url=http://localhost:8000/v1/completions,num_concurrent=1000,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8787|±  |0.0090|
|     |       |strict-match    |     5|exact_match|↑  |0.8931|±  |0.0085|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Robert Shaw <robshaw@redhat.com>

gemini-code-assist

Code Review

This pull request refactors the quantization configuration for FlashInfer experts, removing a hack and improving code clarity. The changes appear correct and align with the goal of cleaning up the implementation. However, the PR includes several debugging statements (e.g., logger.info, print, timing code) across multiple files. These should be removed before merging to avoid polluting logs and potential performance impacts. Additionally, there is a critical issue in vllm/model_executor/models/llama4.py where a logging statement references a variable defined in a commented-out block, which will cause a NameError at runtime.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/model_executor/models/llama4.py (524-539)

This block contains commented-out debugging code and an active logging statement that references start, a variable defined within the commented-out section. This will lead to a NameError at runtime. The entire block should be removed.

vllm/model_executor/layers/fused_moe/layer.py (954-956)

This logging statement appears to be for debugging purposes and should be removed before merging.

vllm/model_executor/layers/fused_moe/layer.py (1012-1021)

This block of code, including the time import and performance logging, seems to be for debugging and should be removed.

vllm/model_executor/layers/fused_moe/layer.py (1274)

This logger.info call appears to be for debugging. Please remove it, along with similar debugging logs at lines 1292, 1304, and the commented-out log at line 1366.

vllm/model_executor/models/mllama4.py (1126)

This print statement appears to be for debugging. Please remove it, along with the other debug prints in this function at lines 1130, 1136, and 1141.

vllm/model_executor/models/utils.py (198)

This logger.info call appears to be for debugging. Please remove it, along with the other debug logs added in this file (lines 219, 255, 264, 292, 302, 334).

Signed-off-by: Robert Shaw <robshaw@redhat.com>

robertgshaw2-redhat · 2026-01-01T15:29:24Z

now working e2e with fi cutlass

need to make a few more nits for flashinfer trtllm

Signed-off-by: Robert Shaw <robshaw@redhat.com>

…thub.com/vllm-project/vllm into fix-flashinfer-experts-quant-config-hack

Signed-off-by: Robert Shaw <robshaw@redhat.com>

robertgshaw2-redhat · 2026-01-05T14:36:59Z

just reran the quality checks on top the head after the nits, accuracy still looks good

pavanimajety

Thank you for making the changes.

robertgshaw2-redhat · 2026-01-05T22:02:44Z

Thank you for making the changes.

Thanks for your great feedback and review @pavanimajety !

amirkl94 · 2026-01-05T22:15:37Z

vllm/model_executor/layers/quantization/modelopt.py

+                w2_scale=layer.w2_weight_scale,
+                a1_scale=layer.w13_input_scale,
+                a2_scale=layer.w2_input_scale,
+                a1_gscale=(1.0 / layer.w13_input_scale),


I think this function is called every forward, which means these 2 lines will result in 2 kernel launches for reciprocal:

a1_gscale=(1.0 / layer.w13_input_scale), a2_gscale=(1.0 / layer.w2_input_scale),

Can we add these 2 scales in process_weights_after_loading ?

its not called in the forward pass. I recognize this is confusing, but the apply() method is not called during the forward pass for flashinfer kernels. When flashinfer CUTLASS kernels are selected, the FpMoeMethod is converted into a ModularKernelMethod

I am working on an ongoing refactor that makes the conversion

This PR implements it for FP8: [MoE Refactor][15/N] Apply Refactor to Fp8 #31415

This PR implements it for NVFP4: [MoE Refactor][16/N] Apply Refactor to NVFP4 #31692

see https://vllm-dev.slack.com/archives/C08NFPURQ1F/p1767650816469009 for more details on my efforts

…#31593) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

…#31593) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…#31593) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

Robert Shaw added 3 commits December 31, 2025 22:32

initial commit

d6a1f64

Signed-off-by: Robert Shaw <robshaw@redhat.com>

move to cuda before the reshapes for r&d

9a28683

Signed-off-by: Robert Shaw <robshaw@redhat.com>

clean

e84eaa2

Signed-off-by: Robert Shaw <robshaw@redhat.com>

mergify bot added llama Related to Llama models nvidia labels Jan 1, 2026

github-project-automation bot added this to NVIDIA Jan 1, 2026

Robert Shaw added 6 commits January 1, 2026 00:45

clean

058a998

Signed-off-by: Robert Shaw <robshaw@redhat.com>

clean

d53b6ff

Signed-off-by: Robert Shaw <robshaw@redhat.com>

clean

9a7cf4d

Signed-off-by: Robert Shaw <robshaw@redhat.com>

clean

1182e1d

Signed-off-by: Robert Shaw <robshaw@redhat.com>

clean

2126f98

Signed-off-by: Robert Shaw <robshaw@redhat.com>

clean

33741a8

Signed-off-by: Robert Shaw <robshaw@redhat.com>

gemini-code-assist bot reviewed Jan 1, 2026

View reviewed changes

robertgshaw2-redhat mentioned this pull request Jan 1, 2026

Bugfix: Cutlass FP8 FusedMoE bad scaling factors #27255

Merged

Robert Shaw added 5 commits January 1, 2026 14:56

stash

31c4e22

Signed-off-by: Robert Shaw <robshaw@redhat.com>

working end to end

e0129dd

Signed-off-by: Robert Shaw <robshaw@redhat.com>

comment nits

eb6699b

Signed-off-by: Robert Shaw <robshaw@redhat.com>

comment nits

5be7ab1

Signed-off-by: Robert Shaw <robshaw@redhat.com>

remove

844a65a

Signed-off-by: Robert Shaw <robshaw@redhat.com>

Robert Shaw added 10 commits January 1, 2026 16:43

rename method

24a0302

Signed-off-by: Robert Shaw <robshaw@redhat.com>

stash trtllm fix

7edf70f

Signed-off-by: Robert Shaw <robshaw@redhat.com>

stash changes

9d994a6

Signed-off-by: Robert Shaw <robshaw@redhat.com>

Merge branch 'fix-flashinfer-experts-quant-config-hack' of https://gi…

6ff4b75

…thub.com/vllm-project/vllm into fix-flashinfer-experts-quant-config-hack

updated

c9a7e5b

Signed-off-by: Robert Shaw <robshaw@redhat.com>

make trtllm work

2408ad2

Signed-off-by: Robert Shaw <robshaw@redhat.com>

ad back import

96ff599

Signed-off-by: Robert Shaw <robshaw@redhat.com>

update comments

f9a4724

Signed-off-by: Robert Shaw <robshaw@redhat.com>

apply changes to fp8.py

e8831f9

Signed-off-by: Robert Shaw <robshaw@redhat.com>

nit

f8f9a33

Signed-off-by: Robert Shaw <robshaw@redhat.com>

nvpohanh approved these changes Jan 5, 2026

View reviewed changes

Robert Shaw and others added 5 commits January 5, 2026 14:04

update cutlass per tensor

12a0638

Signed-off-by: Robert Shaw <robshaw@redhat.com>

Merge branch 'main' into fix-flashinfer-experts-quant-config-hack

6982255

update cutlass per tensor

59db9a9

Signed-off-by: Robert Shaw <robshaw@redhat.com>

update

d6c4a87

Signed-off-by: Robert Shaw <robshaw@redhat.com>

fix trtllm issue

ce913de

Signed-off-by: Robert Shaw <robshaw@redhat.com>

pavanimajety approved these changes Jan 5, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Jan 5, 2026

amirkl94 reviewed Jan 5, 2026

View reviewed changes

robertgshaw2-redhat added 2 commits January 5, 2026 20:33

Merge branch 'main' into fix-flashinfer-experts-quant-config-hack

a8c5cc9

Merge branch 'main' into fix-flashinfer-experts-quant-config-hack

84dc7ea

robertgshaw2-redhat enabled auto-merge (squash) January 6, 2026 04:34

robertgshaw2-redhat added this to MoE Refactor Jan 6, 2026

github-project-automation bot moved this to Backlog in MoE Refactor Jan 6, 2026

robertgshaw2-redhat moved this from Backlog to Ready in MoE Refactor Jan 6, 2026

robertgshaw2-redhat moved this from Ready to In progress in MoE Refactor Jan 6, 2026

robertgshaw2-redhat moved this from In progress to In review in MoE Refactor Jan 6, 2026

robertgshaw2-redhat merged commit af8fd73 into main Jan 6, 2026
62 checks passed

robertgshaw2-redhat deleted the fix-flashinfer-experts-quant-config-hack branch January 6, 2026 15:47

github-project-automation bot moved this from In review to Done in MoE Refactor Jan 6, 2026

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 6, 2026

LucasWilkinson pushed a commit to neuralmagic/vllm that referenced this pull request Jan 6, 2026

[MoE Refactor][14/N] Clean Up FI Quant Config Smuggling (vllm-project…

10f2ac4

…#31593) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026

[MoE Refactor][14/N] Clean Up FI Quant Config Smuggling (vllm-project…

4a3c63f

…#31593) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[MoE Refactor][14/N] Clean Up FI Quant Config Smuggling (vllm-project…

508da7b

…#31593) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[MoE Refactor][14/N] Clean Up FI Quant Config Smuggling (vllm-project…

f686dfb

…#31593) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MoE Refactor][14/N] Clean Up FI Quant Config Smuggling#31593

[MoE Refactor][14/N] Clean Up FI Quant Config Smuggling#31593
robertgshaw2-redhat merged 62 commits intomainfrom
fix-flashinfer-experts-quant-config-hack

robertgshaw2-redhat commented Jan 1, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

robertgshaw2-redhat commented Jan 1, 2026

Uh oh!

robertgshaw2-redhat commented Jan 5, 2026

Uh oh!

pavanimajety left a comment

Uh oh!

robertgshaw2-redhat commented Jan 5, 2026

Uh oh!

amirkl94 Jan 5, 2026 •

edited

Loading

Uh oh!

robertgshaw2-redhat Jan 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

robertgshaw2-redhat commented Jan 1, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Llama4 Scout

Qwen3-30B

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

vllm/model_executor/models/llama4.py (524-539)

vllm/model_executor/layers/fused_moe/layer.py (954-956)

vllm/model_executor/layers/fused_moe/layer.py (1012-1021)

vllm/model_executor/layers/fused_moe/layer.py (1274)

vllm/model_executor/models/mllama4.py (1126)

vllm/model_executor/models/utils.py (198)

Uh oh!

robertgshaw2-redhat commented Jan 1, 2026

Uh oh!

robertgshaw2-redhat commented Jan 5, 2026

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Jan 5, 2026

Uh oh!

amirkl94 Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

robertgshaw2-redhat commented Jan 1, 2026 •

edited by github-actions bot

Loading

amirkl94 Jan 5, 2026 •

edited

Loading

robertgshaw2-redhat Jan 5, 2026 •

edited

Loading